1 * 



® 



Europaisches Patentamt 
European Patent Office 
Office europeen des brevets 



© Publication number: 



ill 

0 481 231 A2 



© EUROPEAN PATENT APPLICATION 

© Application number: 91115808.7 © Int. CI. 5 : G06F 11/00 

© Date of filing: 18.09.91 



® Priority: 17.10.90 US 599178 


Armonk, N.Y. 10504(US) 


© Date of publication of application: 


© Inventor: Smith, Donald M. 


22.04.92 Bulletin 92/17 


14723 Mockingbird Drive 


® Designated Contracting States: 


Germantown, Maryland 20874(US) 


DE FR GB 




0 Applicant: International Business Machines 


© Representative: Jost, Ottokarl, Dipl.-ing. 


IBM Deutschland GmbH Patentwesen und 


Corporation 


Urheberrecht Schonaicher Strasse 220 


Old Orchard Road 


W-7030 Bbblingen(DE) 



CSI 

< 



CO 
CM 

00 



Q. 
UJ 



© A method and system for Increasing the operational availability of a system of computer programs 
operating In a distributed system of computers. 



© A system and method are disclosed to organize 
computer software operating in a distributed system 
of computers, so that its recovery from a failure of 
either the software or the hardware occurs before the 
failure becomes operationally visible. The software is 



made to recover from the failure and reprocess or 
reject the stimulus such that the result is available to 
the user of the system within the specified response 
time for that type of stimulus. 
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The invention disclosed broadly relates to data 
processing systems and more particularly relates 
to systems and methods for enhancing fault toler- 
ance in data processing systems. 

Operational availability is defined as follows: "If 5 
a stimulus to the system is processed by the 
system and the system produces a correct result 
within an allocated response time for that stimulus, 
and if this is true for all stimuli, then the system's 
availability is 1.0." 10 

It is recognized that there are many contribu- 
tors to high operational availability: (1) failures in 
both the hardware system and the software system 
must be detected with sufficiently high coverage to 
meet requirements; (2) the inherent availability of 15 
the hardware (in terms of simple numerical avail- 
ability of its redundancy network), including internal 
and external redundancy must be higher than the 
system's required operational availability; and (3) 
failures in the software must not be visible to or 20 
adversely affect operational use of the system. This 
invention addresses the third of these contributors 
with the important assumption that software failures 
due to design errors and to hardware failures, will 
be frequent and hideous. 25 

The prior art has attempted to solve this type 
of problem by providing duplicate copies of the 
entire software component on two or more individ- 
ual processors, and providing a means for commu- 
nicating the health of the active processor to the 30 
standby processor. When the standby processor 
determines, through monitoring the health of the 
active processor, that the standby processor must 
take over operations, the standby processor initial- 
izes the entire software component stored therein 35 
and the active processor effectively terminates its 
operations. The problem with this approach is that 
the entire system is involved in each recovery 
action. As a result, recovery times tend to be long, 
and failures in the recovery process normally ren- 40 
der the system inoperable. In addition, if the redun- 
dant copies of the software systems are both nor- 
mally operating (one as a shadow to the other), 
then the effect of common-mode failures is ex- 
treme and also affects the whole system. 45 

It is therefore an object of the invention to 
increase the operational availability of a system of 
computer programs operating in a distributed sys- 
tem of computers. 

It is another object of the invention to provide 50 
fault tolerance in a system of computer programs 
operating in a distributed system of computers, 
having a high availability and fast recovery time. 

It is still a further object of the invention to 
provide improved operational availability in a sys- 55 
tern of computer programs operating in a distrib- 
uted system of computers, with less software com- 
plexity, than was required in the prior art. 



These and other objects, features and advan- 
tages of the invention are accomplished as follows. 
This invention provides a mechanism to organize 
the computer software in such a way that its recov- 
ery from a failure (of either itself or the hardware) 
occurs before the failure becomes operationally 
visible. In other words, the software is made to 
recover from the failure and reprocess or reject the 
stimulus so that the result is available to the user of 
the system within the specified response time for 
that type of stimulus. 

A software structure that will be referred to as 
an operational unit (OU), and a related availability 
management function (AMF) are the key compo- 
nents of the invention. The OU and portions of the 
AMF are now described. The OU concept is imple- 
mented by partitioning as much of the system's 
software as possible into independent self-contain- 
ed modules whose interactions with one another is 
via a network server. A stimulus enters the system 
and is routed to the first module in its thread, and 
from there traverses all required modules until an 
appropriate response is produced and made avail- 
able to the system's user. 

Ech module is in fact two copies of the code 
and data-space of the OU. One of the copies, 
called the Primary Address Space (PAS), maintains 
actual state data. The other copy, called the Stan- 
dby Address Space (SAS), runs in a separate pro- 
cessor, and may, or may not maintain actual state 
data, as described later. 

The Availability Management Function (AMF) 
controls the allocation of PAS and SAS compo- 
nents to the processors. When the AMF detects an 
error, a SAS becomes PAS and the original PAS is 
terminated. Data servers in the network are also 
informed of the change so that all communication 
will be redirected to the new PAS. In this fashion, 
system availability can be maintained. 

These and other object, features and advan- 
tages of the invention will be more fully appre- 
ciated with reference to the accompanying figures. 
Fig- 1 is a timing diagram illustrating re- 

sponse time allocation. 
Fig. 2A is a schematic diagram of several 

operational units in a network. 
Fig. 2B is an alternate diagram of the oper- 
ational unit architecture illustrating 
the use of a data server OU by 
several applications. 
Fig. 2C is an illustration of the operational 

unit architecture. 
Fig. 2D is an illustration of the operational 
unit architecture showing an exam- 
ple of the allocation of general 
operational units across three 
groups. 

Fig. 3A-F shows of the operational unit at 
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various stages during the initializa- 
tion and operation, reconfiguration 
and recovery functions. 
Fig. 1 shows a timing diagram illustrating re- 
sponse time allocation. The required response time 
(Tmax) between a stimulus input and its required 
output suballocated so that a portion of it is avail- 
able for normal production of the response 
(Tnormal), a portion is available for unpredicted 
resource contention (Tcontention), and a portion is 
available for recovery from failures (Trecovery). 
The first of these, Tnormal, is divided among the 
software and hardware elements of the system in 
accordance with their processing requirements. 
This allocation will determine the performance 
needed by each hardware and software component 
of the system. The second portion, Tcontention, is 
never allocated. The last portion, Trecovery, is 
made sufficiently long that it includes time for error 
detection (including omission failures), hardware re- 
configuration, software reconfiguration, and repro- 
duction of required response. A rule of thumb is to 
divide the required response time Tmax in half and 
subdivide the first half so that one quarter of the 
required response time is available for normal re- 
sponse production Tnormal, and the second quar- 
ter is available for unpredicted resource contention, 
Tcontention. The second half of the response time, 
Trecovery, is then available for failure detection 
and response reproduction. 

The specific problem addressed by this inven- 
tion is how to reduce the time required for hard- 
ware and software reconfiguration, Trecovery, of a 
complex system to a small fraction of Tmax. This 
problem is solved by a software structure that will 
be referred to as an operational unit (OU), and a 
related availability management function (AMF). 
The OU and portions of the AMF are now de- 
scribed. 

Referring to Fig. 2A, the OU concept is imple- 
mented by partitioning as much of the system's 
software as possible into independent self-contain- 
ed modules or OUs 10 whose interactions with one 
another are via a network server 12. None of these 
modules shares data files with any other module, 
and none of them assumes that any other is (or is 
not) in the same machine as itself. A stimulus 14 
enters the system and is routed to the first module 
in its thread and from there traverses all requested 
modules until an appropriate response is produced 
and made available to the system's user 16. 

Each module maintains all necessary state 
data for its own operation. If two or more modules 
require access to the same state knowledge then: 
(1) each must maintain the knowledge; (2) updates 
to that knowledge must be transmitted among them 
as normal processing transactions; or (3) each 
must be tolerant of possible difference between its 



knowledge of the state, and the knowledge of the 
others. This tolerance may take on several forms 
depending on the application, and may include 
mechanisms for detecting and compensating (or 
5 correcting) for differences in state. If two modules 
10 absolutely must share a common data base for 
performance or other reason, then they are not 
"independent" and are combined into a single 
module for the purpose of this invention. 
w It is acceptable for one module 10* to perform 
data server functions for multiple other modules (as 
shown in Fig. 2B) providing those other modules 
10 can operationally compensate for failure and 
loss of the server function. Compensate means that 

T5 they continue to provide their essential services to 
their clients, and that the inability to access the 
common state does not result in unacceptable 
queuing or interruption of service. Clearly this con- 
strains the possible uses of such common servers. 

20 Finally, a module 10 must provide predefined 
degraded or alternative modes of operation for any 
case where another module's services are used, 
but where that other module is known to be indefi- 
nitely unavailable. An exception to this rule is that if 

25 a server is part of a processor (if it deals with the 
allocation or use of processor resources), then the 
module may have unconditional dependencies on 
the server. In this case, failure of the server is 
treated by the availability management function as 

30 failure of the processor. If a module conforms to all 
of the above conditions, then it becomes an OU 
through the application of the following novel struc- 
turing of its data, its logic and its role within the 
system. This structure is shown in Fig. 2C. 

35 Two complete copies of the modules are load- 
ed into independent address spaces 20, 20' in two 
separate computers 22, 22'. One of these modules 
is known by the network server as the Primary 
Address Space (PAS) and is the one to which all 

ao other module's service requests are directed. The 
other of these modules is called the Standby Ad- 
dress Space (SAS), and is known only by the PAS. 
It is invisible to all other modules. The PAS sends 
application dependent state data to the SAS so that 

45 the SAS is aware of the state of the PAS. Whether 
the interface between the PAS and the SAS of an 
OU is a synchronous commit or is an unsynch- 
ronized interface is not limited by this invention, 
and is a trade-off between the steady-state re- 

50 sponse time, and the time required for a newly 
promoted PAS to synchronize with its servers and 
clients. This trade-off is discussed in strategy #1 
below. 

The PAS maintains the state necessary for 
55 normal application processing by the module. The 
SAS maintains sufficient state knowledge so that it 
can transition itself to become the PAS when and if 
the need should arise. The amount of knowledge 



5 



EP 0 481 231 A2 



6 



this requires is application dependent, and is be- 
yond the scope of this invention. 

Both the PAS and the SAS of an OU maintain 
open sessions with all servers of the OU. The SAS 
sessions are unused until/unless the SAS is pro- s 
moted to PAS. When the SAS is directed by the 
AMF to assume the role of PAS, it assures that its 
current state is self-consistent (a failure may have 
resulted from the PAS sending only part of a 
related series of update messages), and then com- io 
municates with clients and servers of the PAS to 
establish state agreement with them. This may 
result in advancing the SAS's state knowledge or in 
rolling back the state knowledge of the clients and 
servers. Any rolling back must be recovered by 75 
reprocessing any affected stimuli, and/or by inform- 
ing users that the stimuli have been ignored. Si- 
multaneous with this process, the network server is 
updated so that it directs all new or queued service 
requests to the SAS instead of the PAS. This last 20 
action constitutes promotion of the SAS to the 
position of PAS, and is followed by starting up a 
new SAS in a processor other than the one oc- 
cupied by the new PAS. 

Several strategies for maintenance of standby 25 
data by the PAS are relevant to the invention. They 
are summarized as follows. 

Strategy #1. The SAS may retain a complete 
copy of the state of the PAS. If the copy is commit- 
ted to the SAS before the PAS responds to its 30 
stimulus, then the restart recovery will be very fast, 
but the response time for all stimuli will be the 
longest. This approach is superior if response time 
requirements allow it. 

Strategy #2. The SAS may retain a "trailing" 35 
copy of the state of the PAS. Here, the PAS sends 
state updates as they occur, at the end of process- 
ing for a stimulus, or batches them for several 
stimuli. The SAS trails the PAS state in time and 
must therefore be concerned with state consistency 40 
within its data and between itself and its servers 
and clients. This yields very fast steady-state re- 
sponse time but requires moderate processing dur- 
ing failure recovery. 

Strategy #3. The SAS may retain knowledge of 45 
the stimuli currently in process by the PAS so that 
at failure, the exact state of the relationships be- 
tween the PAS and its clients and servers is known 
by the SAS. This requires commitment of begin- 
ning and end of transactions between the PAS and 50 
the SAS, but reduces the inter-OU synchronization 
required during failure recovery. 

These mechanisms may be used alone or in 
various combinations by an OU. The determination 
of which to use is a function of the kinds of stimuli, 55 
the response time requirements, and the nature of 
the state retained by the application. 



The Availability Management Function (AMF) 

The characteristics just described comprise an 
OU, but do not by themselves achieve availability 
goals. Availability goals are achieved by combining 
this OU architecture with an AMF that controls the 
state of all OU's in the system. The AMF has three 
components, each with its own roles in the main- 
tenance of high availability of the system's OU's. 
The relationship between an OU and the AMF is 
illustrated in Figs. 3A through 3F and is described 
below. 

The most important AMF function is that of 
group manager. A group is a collection of similarly 
configured processors that have been placed into 
the network for the purpose of housing a predesig- 
nated set of one or more OU's. Each group in the 
system (if there are more than one) is managed 
independently of all other groups. In Fig. 2D, three 
groups are shown. Here, the OU's residing in each 
group (rather than the processors) are shown. 

The number of processors in each group, and 
the allocation of OU PAS and SAS components to 
those processors may be widely varied in accor- 
dance with the availability requirement, the power 
of each processor, and the processing needs of 
each PAS and SAS component. The only constraint 
imposed by the invention is that the PAS and the 
SAS of any single OU must be in two different 
processors if processor failures are to be guarded 
against. 

Group Management Referring to Fig. 3A, the 
group manager resides in one of the processors of 
a group (the processor is determined by a protocol 
not included in this invention). It initializes the 
group for cold starts, and reconfigures the group 
when necessary for failure recovery. Within the 
group, each OU's PAS and SAS exist on separate 
processors, and sufficient processors exist to meet 
inherent availability requirements. The group man- 
ager monitors performance of the group as a unit 
and coordinates the detection and recovery of all 
failures within the group that do not require atten- 
tion at a system level. This coordination includes 
the following: 

1. Commanding the takeover of an OU's func- 
tional responsibilities by the SAS when it has 
been determined that a failure has occurred in 
either the PAS, or in the processor or related 
resource necessary to the operation of the PAS. 

2. Initiating the start-up of a new address space 
to house a new SAS after a prior SAS has been 
promoted to PAS, 

3. Updating the network server's image of the 
location of each OU as its PAS moves among 
the processors of a group in response to failures 
or scheduled shutdown of individual group pro- 
cessors. 
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Errors detected by the group manager include 
processor failure (by heartbeat protocols between 
group members), and failures of an OU that affect 
both its PAS and SAS, for example, those caused 
by design errors in the communication between 5 
them or the common processing of standby/backup 
state data. The mechanisms for error detection are 
beyond the scope of the invention. 

Local Management Group level support of 
the OU architecture relies on the presence of cer- w 
tain functions within each processor of the group. 
These functions are referred to as the AMPs local 
manager 32. The local manager is implemented 
within each processor as an extension of the pro- 
cessor's control program and is responsible for 15 
detecting and correcting failures that can be han- 
dled without intervention from a higher (group or 
above) level. The local manager maintains heart- 
beat protocol communications with each OU PAS 
or SAS in its processor to watch for abnormalities 20 
in their performance. It also receives notification 
from the operating system of any detected ma- 
chine level hardware or software problem. Any 
problem that cannot be handled locally is forwar- 
ded to the group manager for resolution. 25 

Global Management The isolation and cor- 
rection of system level problems and problems 
associated with the network fall with the AMPs 
global manager 34. The global manager 34 cor- 
relates failures and states at the lower levels to 30 
detect and isolate errors in network behavior, group 
behavior, and response times of threads involving 
multiple processors. It also interacts with human 
operators of the system to deal with failures that 
the automation cannot handle. The global manager 35 
is designed to operate at any station in the net- 
work, and is itself an OU. The movement of the 
global manager 34 OU from one group to another, 
if necessary, is initiated and monitored by the 
human operator using capabilities contained in the 40 
local manager at each processor station. 

Network Management The network manager 
is a part of the AMF global manager 34. Its func- 
tionality and design is configuration dependent, and 
is beyond the scope of this invention. 45 

Figs. 3A-F show the functionality for the AMF 
during initialization, operation, reconfiguration and 
recovery of the system. In Fig. 3A, PAS and SAS 
are loaded and initialized. OU locations are entered 
in the network server. Fig. 3B shows the synchro- 50 
nization of the PAS and SAS with clients and 
servers to establish consistency of states. Fig. 3C 
shows steady state operation with PAS responding 
to a stimulus and outputting a response. The SAS 
is kept updated by the PAS. 55 

Fig. 3D shows what happens when the PAS 
fails. The old SAS is promoted to new PAS and a 
new SAS is loaded and initialized. The network 
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servers are also updated with the new location of 
the OU. Fig. 3E shows re-synchronization of the 
new PAS with the clients and servers. At steady 
state (Fig. 3E), the new PAS is responding to 
stimuli as normal. 

It is the combination of the OU structure and 
the recovery functions of the AMF that constitute 
this invention. By comparison, the contemporary 
strategy and mechanisms of software 
failover/switchover deal with units of entire proces- 
sors rather than with smaller units of software as in 
this invention. Because the unit of failure and re- 
covery achieved by this invention is small by com- 
parison, the time for recovery is also small. Fur- 
thermore, the recovery from processor level fail- 
ures can be accomplished in small steps spread 
across several processors (as many as are used to 
contain SAS's for the PAS's contained in the failed 
processor). This is crucial to maintaining continu- 
ous high availability operation in large real time 
systems. 

Claims 

1. A method for increasing the operational avail- 
ability of a system of computer programs op- 
erating in a distributed system of computers, 
comprising: 

dividing a computer program into a plurality of 
functional modules; 

loading a first copy of a functional module into 
a first processor's address space and locating 
a second copy of said functional module into a 
second processor's address space; 

said first processor executing said first func- 
tional module to send application dependent 
state data to said second processor where it is 
received by said second functional module ex- 
ecuting on said second processor; 

said first processor executing said first module, 
maintaining a normal application processing 
state and said second processor executing 
said second module, maintaining a secondary 
state knowledge sufficient to enable it to be- 
come a primary functioning module; 

said first processor executing said first module 
maintaining open sessions with a plurality of 
servers connected therewith in a network and 
said second processor executing said second 
module maintaining a plurality of open ses- 
sions with all of said servers in said network; 

said second functional module, in response to 
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a stimulus requiring it to assume the role of 
said first functional module, checking that its 
current state is consistent with the current 
state of said first functional module, followed 
by said second module, then communicating 5 
with said servers in said network to establish 
synchronization with the state of said servers; 

all clients and servers connected in said net- 
work responding to said second module as- 10 
suming the role of said first module, by direct- 
ing all new or queued service requests to said 
second module instead of to said first module; 

whereby said second module assumes the role ?5 
of said first module in performing primary ad- 
dress space operations. 

2. The method of claim 1 wherein said first mod- 
ule and said second module communicate 20 
synchronously. 

3. The method of claim 1 wherein said first mod- 
ule and said second module communicate 
asynchronously. 25 

4. The method of claim 1 wherein said second 
module retains a complete copy of the state of 
said first module. 

30 

5. The method of claim 1 wherein said second 
module retains a trailing copy of the state of 
said first module. 

6. The method of claim 1 wherein said second 35 
module retains knowledge of the stimuli cur- 
rently in process by said first module. 
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FIG. 3 A 
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